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Automatic performance debugging of parallel applications includes two main steps: locating performance bottlenecks 
and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in 
'two ways: first, several previous efforts automate locating bottlenecks, but present results in a confined way that only 
identifies performance problems with apriori knowledge; second, several tools take exploratory or confirmatory data 
analysis to automatically discover relevant performance data relationships, but these efforts do not focus on locating 
performance bottlenecks or uncovering their root causes. 

' The simple program and multiple data (SPMD) programming model is widely used for both high performance comput- 
ing and Cloud computing. In this paper, we design and implement an innovative system. Auto Analyzer, that automates 
the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance 
behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two 
'features: first, without any apriori knowledge, it automatically locates bottlenecks and uncovers their root causes for 
performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. 
Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of 
■performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; mean- 
while, we present two searching algorithms to locate bottlenecks; second, on a basis of the rough set theory, we propose 
an innovative approach to automatically uncovering root causes of bottlenecks; third, on the cluster systems with two dif- 
ferent configurations, we use two production applications, written in Fortran 77, and one open source code — MPIBZIP2 
(http : //compression. ca/mpibzip2/), written in CH — h, to verify the effectiveness and correctness of our methods. For 
three applications, we also propose an experimental approach to investigating the effects of different metrics on locating 
bottlenecks. 

•Keywords: SPMD parallel programs, automatic performance debugging, performance bottleneck, root cause analysis, 
'performance optimization 
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1. Introduction 

How to improve the efficiency of parallel programs is a 
challenging issue for programmers, especially non-experts 
without the deep knowledge of computer science, and 
hence it is a crucial task to develop an automatic per- 
formance debugging tool to help application programmers 
analyze parallel programs' behavior, locate performance 
bottlenecks (in short bottlenecks), and uncover their root 
causes for performance optimization. 

Although several existing tools can automate analysis 
processes to some extent, previous work fails to resolve 
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this issue in three ways. First, with traditional perfor- 
mance debugging tools [Hi, though data collec- 
tion processes are often automated, detecting bottlenecks 
and uncovering their root causes need great manual ef- 
forts. Second, several previous efforts can only automati- 
cally identify critical bottlenecks with apriori knowledge 
specified in terms of either the execution patterns that 
represent situations of inefScient behaviors [1] [ll 5| 
or the predefined performance hypotheses/thresholds [7| 9| 
or the decision tree classification trained by microbench- 
mar ks f26| . Third, while a lots of existing tools [3] [l^ 
[linii] [ii] [li] m take exploratory or confirma- 
tory data analysis approaches to automatically discovering 
relationships of relevant performance data, these efforts do 
not focus on locating performance bottleneck and uncov- 
ering their root causes of performance bottlenecks. 

The SPMD |43|] programming model is widely used for 
high performance computing [45| . Recenth^ 
stance 44 1, Mapreduce-like techniques [4§| [ia] also pro- 
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mote the wide use of the SPMD programming model in 
Cloud computing 45 1 47 1 [4§|. This paper focuses on how 
to automate the process of debugging performance prob- 
lems of SPMD style programs: including collecting per- 
formance data, analyzing application behavior, detecting 
bottlenecks, and uncovering their root causes, but not in- 
cluding performance optimization. To that end, we design 
and implement an innovative system, AutoAnalyzer. 

Without human involvement, our tool uses source-to- 
source transformation to automatically insert the instru- 
mentation code into the source code of a parallel program, 
and divide the whole program into code regions, each of 
which is a section of code executed from start to finish 
with one entry and one exit. For a SPMD-style parallel 
program, if we exclude code regions in the master pro- 
cess responsible for the management routines, each pro- 
cess or thread should have similar behavior. At the same 
time, if a code region takes up a trivial proportion of a 
program's running time, the performance improvement of 
the code region will contribute little to the overall perfor- 
mance of the program. From the above intuition, in this 
paper we pay attentions to two types of performance bot- 
tlenecks: bottlenecks that cause process or thread behavior 
dissimilarity, which we call dissimilarity bottlenecks, and 
bottlenecks that cause code region behavior disparity — 
significantly different contributions of code regions to the 
overall performance, which we all disparity bottlenecks. Af- 
ter collecting the performance data of code regions from 
four hierarchies: application, parallel interface, operat- 
ing system, and hardware, AutoAnalyzer proposes a series 
of innovative approaches to searching code regions that 
arc dissimilarity and disparity bottlenecks and uncover- 
ing their root causes for performance optimization. Our 
contributions are concluded as follows: 



The rest of this paper is organized as follows: Section 
[5] formulates the problem. Section [3] outlines the related 
work, followed by the description of our solution in Sec- 
tion 2) The implementation and evaluation of AutoAna- 
lyzer are depicted in Section [5] and Section [BJ respectively. 
Finally, concluding remarks arc listed in Section [T] 

2. Problem Statement 

A code region is a section of code that is executed from 
start to finish with one entry and one exit. A code region 
can be a function, subroutine or loop, which can be nested 
within another one. After dividing the whole program 
into n code regions CRj^ j^i,,,n, we organize CRj^ j^i...n 
as a tree structure with the whole program as the root. 
According to the definition of the tree structure, for any 
node CRj , its depth is the length of the path from the root 
to CRj. For example, in Fig[Tl the depth of code region 
1 is one. We call a code region of the depth L an L-code 
region. 

In our system, to accurately measure the contribution 
of each code region to the overall performance of the pro- 
gram, we require that code regions that have the same 
depth can not be overlapped. For code regions with dif- 
ferent depths, we encourage the nesting of code regions 
because deep nesting leads to fine granularity, which is 
helpful in narrowing the scope of the source code in lo- 
cating bottlenecks. For example, in Fig[l] for two 1-code 
regions, code region 1 and code region 2 do not intersect. 
For code region 1, its two children nodes: code region 4 
and code region 6 are nested within it. 
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For SPMD-style parallel applications, we utilize two 
effective clustering algorithms to investigate the exis- 
tence of performance bottlenecks that cause process 
behavior dissimilarity or code region behavior dispar- 
ity, respectively; if there are bottlenecks, we present 
two searching algorithms to locate performance bot- 
tlenecks. 

On a basis of the rough set theory, we propose an 
innovative approach to automatically uncovering root 
causes of bottlenecks. 

We design and implement AutoAnalyzer. On the 
cluster systems with two different configurations, we 
use two production applications and one open source 
code — MPIBZIP2 to verify the effectiveness and cor- 
rectness of our system. We also investigate the effects 
of different metrics on locating bottlenecks, our ex- 
periment results showed for three applications, our 
proposed metrics outperforms the cycles per instruc- 
tion (CPI) and the wall clock time in terms of locating 
disparity bottlenecks. 
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Figure 1: The code region tree of a parallel program. 

For a parallel program, if its processes or threads have 
similar behavior, performance vectors of all processes or 
threads should be classified into one cluster, or else there 
are dissimilarity bottlenecks, indicating load imbalance. 
For each code region, if we average its performance data 
among all processes or threads, we can measure its con- 
tribution to the overall performance. We will identify a 
code region that takes up a significant proportion of a pro- 
gram's running time and has the potential for performance 
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improvement as a disparity bottleneck. Of course, we can 
not exhaust all types of bottlenecks, since users hope pro- 
grams to run faster and faster. 

Our work focuses on how to automatically locate dissim- 
ilarity and disparity bottlenecks, and uncover their root 
causes for performance optimization. However, automatic 
performance optimization is not our target. 

3. Related Work 

Table 1: The comparison of different systems. Yes indi- 
cates it is automatic; else not. 
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Table [T] summarizes the differences of the related sys- 
tems from five perspectives: data collection, behavior anal- 
ysis, bottleneck detection, uncovering root causes and per- 
formance optimization. HoUingsworth et al [32| proposes 
a plan to develop a test suite for verifying the effective- 
ness of different tools in terms of locating performance 
bottlenecks. If it succeeds, this test suite can provide a 
benchmark for evaluating the accuracy of locating bottle- 
necks for different tools in terms of the false positive and 
the false negative. Unfortunately, this project seems ended 
without updating its web site. 

The traditional approach for performance debugging is 
through automated data collection and visualizing perfor- 
mance data, while performance analysis and code opti- 
mization need great manual efforts. With this approach, 
application programmers need to learn appropriate tools, 
and rely on their expertise to interpret data and its re- 
lation to the code [2J]so as to optimize the code. For 
example, HPCVieweTTiaj, HPCTOOLKIT f40|, and TAU 
^36i] display the performance metrics through a graphical 
user interface. Users depend on their expertise to choose 
valuable data, which is hard and tedious. 



With apriori knowledge, previous work proposes sev- 
eral automatic analysis solutions to identify critical bot- 
tlenecks. The EXPERT system 0] [3 d describes 
performance problems using a high level of abstraction in 
terms of execution patterns that result from an inefficient 
use of the underlying programming models, and performs 
trace data analysis using an automated pattern-matching 
approach The Paradyn parallel performance tool .y\ 
starts searching for bottlenecks by issuing instrumenta- 
tion requests to collect data of a set of pre-defined perfor- 
mance hypotheses for the whole program. Paradyn starts 
its search by comparing the collected performance data 
with the predefined thresholds, and the instances where 
the measured value for the hypothesis exceeds the thresh- 
old are defined as bottlenecks [9i]- Paradyn starts a hier- 
archical search of the bottlenecks, and refines this search 



by using stack sampling llj and pruning the search space 
through considering the behavior of the application dur- 
ing previous runs [8,]. Using a decision tree classification, 
which is trained by the microbenchmarks that demonstrate 
both efficient and inefficient communication, Vetter et al 
[26] automatically classify individual communication op- 
erations, and reveal the cause of communication inefficien- 
cies in the application. The Aksum tool 33] automatically 
performs multiple runs of a parallel application and detects 
performance bottlenecks by comparing the performance 
achieved varying the problem size and the number of al- 
located processors. The key idea in the work of [s^ [s^ 
is to extract performance knowledge from parallel design 
patterns or model that represent structural and communi- 
cation patterns of a program for performance diagnosis. 

Several previous efforts propose exploratory or confirma- 
tory data analysis 24] or fuzzy set method [soj to auto- 
mated discoveries of relevant performance data. The Per- 
fExplorer tool [l^ [2^ [2l[ addresses the need to manage 
large-scale data complexity using techniques such as clus- 
tering and dimensionality reduction, and performs auto- 
mated discovery of relevant data relationships using com- 
parative and correlation analysis techniques. By cluster- 
ing thread performance for different metrics, PerfExplorer 
should discover these relationships and which metrics best 
distinguish their differences. Calzarossa et al. pro- 
poses a top-down methodology towards automatic perfor- 
mance analysis of parallel applications: first, they focuses 
on the overall behavior of the application in terms of its ac- 
tivities, and then they consider individual code regions and 
activities performed within each code region. Calzarossa 
et al. ^15^ utilizes clustering techniques to summarize and 
interpret the performance information by identifying pat- 
terns or groups of code regions characterized by a similar 
behavior. Ahn et al. f2^ use several multivariate statisti- 
cal analysis techniques to analyze parallel performance be- 
havior, including cluster analysis and F-ratio, factor anal- 
ysis, and principal component analysis. Ahn et al. [2^ 
show how hardware counters could be used to analyze the 
performance of multiprocessor parallel machines. The pri- 
mary goal of the SimPoint system [31] is to reduce long- 
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running applications down to tractable simulations. Sher- 
wood et al. 31[ define the concept of basic block vectors, 
and use those concepts to define the behavior of blocks 
of execution, usuall y o ne million instructions at a time. 
Truong et al. [2§| [33| propose a fuzzy set approach to 
search bottlenecks. However, it does not intend to un- 
cover the root causes of bottlenecks. Tallent et al. 
propose the approaches to measure and attribute parallel 
idleness and parallel overhead of multi-threaded parallel 
applications. 

Tiwari et al. {s^ describes a scalable and general- 
purpose framework for auto-tuning compiler-generated 
code, which generates in parallel a set of alternative im- 
plementations of computation kernels and automatically 
selects the one with the best-performing implementation. 
Tu et al. 41 1 [43 | propose a new parallel computation 



model to characterize the performance effects of the mem- 
ory hierarchy on multi-core clusters in both vertical and 
horizontal levels. Babu et al. 3j| make a case for tech- 
niques to automate the setting of tuning parameters for 
MapReduce programs. Zhang et al. [27*1 propose a precise 
request tracing approach to debug performance problems 
of multi-tier services of black boxes. 

Our system has two distinguished differences from other 
systems as shown in Table [TJ first, in addition to auto- 
matic performance behavior analysis, we automatically lo- 
cate bottlenecks of SPMD-style parallel programs without 
apriori knowledge; second, we automatically uncover the 
root causes of bottleneck for performance optimization. 
With regard to proposing performance vectors to represent 
behavior of parallel application, AutoAnalyzer is similar to 



2l| [29| , but we investigate the ef- 



the work in [15| [19| [ 

feet of different metrics on locating bottlenecks. Different 
from PerfExplorer [l^ [2^ 21 1, which leverages sophisti- 



cated clustering techniques, AutoAnalyzer adopts compar- 
atively simple clustering algorithms, which are lightweight 
in terms of the size of performance data to be collected 
and analyzed. 

4. Our Solution 

This section includes four parts: Section 14.11 summa- 
rizes our approach, followed by the description of the ap- 
proaches to investigating the existence of bottlenecks in 
Section How to locate bottlenecks is given out in Sec- 
tion 14.31 Finally, we propose an approach to uncovering 
the root causes of bottlenecks. 

4-1. Summary of our approach 

Our method includes four major steps: instrumentation, 
collecting performance data, locating bottlenecks, and un- 
covering their root causes. 

First, we instrument a whole parallel program into code 
regions. Our tool uses source-to-source transformation to 
automatically insert instrumentation code into the source 
code, which requires no human involvement. 



Second, we collect performance data of code regions. 
For each process or thread, we collect the following perfor- 
mance data of code regions: (1) application-level perfor- 
mance data: wall clock time and CPU clock time; (2) hard- 
ware counter performance data: clock cycle, instructions 
retired, LI cache miss, L2 cache miss, LI cache access, L2 
cache access; (3) communication performance data: MPI 
communication time — the executing time in MPI library 
and MPI communication quantity — the quantity of data 
transferred by the MPI library; (4) operation system level 
performance data: disk 1/ quantity — the quantity of data 
read and written by disk I/O. On a basis of hardware 
counter performance data, we obtain two derived metrics: 
LI cache miss rate and L2 cache miss rate. For exam- 
ple LI cache miss rate can be obtained according to the 
formula — ((LI cache miss) / (LI cache access)). 

Third, we utilize two clustering approaches to investigat- 
ing the existence of bottlenecks. If there are bottlenecks, 
we use two searching algorithms to locate bottlenecks. 

Finally, on a basis of the rough set theory, we present 
an approach to uncovering the root causes of bottlenecks. 

4. 2. Investigating the existence oj bottlenecks 

In this section, we present how to investigate existence 
of dissimilarity bottlenecks and disparity bottlenecks, re- 
spectively. 

4.2.1. The existence of dissimilarity bottlenecks 

For a SPMD program, each process or thread is com- 
posed of the same code regions. If we exclude code re- 
gions in the master process responsible for the manage- 
ment routines, the high behavior similarity of each process 
or thread indicates the balance of workload dispatching 
and resources utilizing, and vice versa [15|. So we use a 
similarity analysis approach to investigate the existence of 
dissimilarity bottlenecks. 

The performance similarity is analyzed among all par- 
ticipating processes or threads to discover the discrepancy. 
We presume that the whole program is divided into n 
code regions, and the whole program has m processes or 
threads. In our approach, each process' or thread' perfor- 
mance is represented by a vector 1^, where i is the process 
or thread rank. Tit represents the performance measure- 
ment of the tth code region in the ith process or thread. 
So Vi is described as = {Tii,Ti2 ■ ■■ .Tm). 

We define the Euclidean distance — Distij of two vectors 



and in Equation II]). 



Distj 



{Ta-T,iY + --- + {T,„~Tj^y (1) 



We choose the CPU clock time of each code region as 
the main measurement. Different from the wall clock time, 
the CPU clock time only measures the time during which 
the processor is actively working on a certain task, while 
the wall clock time measures the total time for a process 
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to complete. We also observe the effect of choosing differ- 
ent metrics — the wall clock time on locating dissimilarity 
bottlenecks in Section 

On a basis of Equation! B, we present a simplified OP- 
TICS clustering method l|] — Algorithm [1] to classify all 
processes or threads. We choose the simplified OPTICS 
clustering method because it has advantage in discovering 
isolated points. In this approach, the performance vector 
of each process or thread is considered as a point in an 
n-dimension space. A set of points is classified into one 
cluster if the point density in the area, where these point 
scattered, is larger than the defined threshold. If a point 
is not included into any clusters, we consider it an isolated 
point, which is also a new cluster. 

Algorithm 1 The simplified OPTICS clustering algo- 
rithm {} 

1. repeat 

2. select a performance vector not belonging to any 
clusters. 

3. count=0; 

4. for each point {q ^ p) in the n-dimension space 
do 

5. if (distance(T^ , V^) < threshold) then 

6. count++; 

7. //We set the threshold as 10% x length (V^). 

8. end if 

9. end for 

10. if count > count threshold then 

11. confirm that this is a new cluster. 

12. end if 

13. until all vectors are compared. 

For a SPMD program, if Algorithm [T] classifies perfor- 
mance vectors of all processes or threads into one cluster, 
indicating all processes have similar performance behav- 
ior, we confirm that there are no dissimilarity bottlenecks, 
or else there are dissimilarity bottlenecks. 

4-2.2. The existence of disparity bottlenecks 

For each code region, if we average performance data 
among all processes or threads, we can measure its contri- 
bution to overall performance. We will identify a code re- 
gion that takes up a significant proportion of a program's 
running time and has the potential for performance im- 
provement as a disparity bottleneck. 

We propose a single normalized metric, named the code 
region normalized metric (in short, CRNM), as the mea- 
surement basis for performance contribution of each code 
region to the overall performance of the application. For 
each code region, CRNM is defined in Equation 



CRNM 



CRWT 



*CPI 



(2) 



WPWT 

In Equation CRWT is the wall clock time of the 
code region; WPWT is the wall clock time of the whole 



program; CPI is the average cycles per instruction of each 
code region. In Section 16.41 we also investigate the effects 
of choosing other metrics, e.g., CPI and wall clock time of 
each code region, on locating disparity bottlenecks. 

As shown in Fig[21 the procedure of searching disparity 
bottlenecks is as follows: 

First, for each processes or thread, we obtain the CRNM 
value of each code region. If a code region is not on the call 
path in a process or thread, its CRNM value is zero. Since 
a SPMD program can contain 'if statements, we obtain 
the average value of each code region among all processes 
or threads. 

Second, we use a k-means clustering method 



IJ to 



classify each code region according to the average CRNM 
value. We choose the k-means clustering method because 
it can classify data into k clusters without user provid- 
ing the threshold value. We define five severity categories: 
very high (4), high (3), medium (2), low (1), and very low 
(0). The k-means clustering method finally classifies each 
code region into one of the severity categories according to 
its CRNM value. 

Third, if a code region is classified into one of severity 
categories of very high or high, we consider it as a critical 
code region (CCR). 
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Figure 2: The k-means clustering approach [IJ 



4.3. Locating bottlenecks 

When users confirm there are bottlenecks, they need 
to locate bottlenecks. We call the code region that is a 
bottleneck a critical code region (in short CCR). A CCR 
of the depth L is called an L-CCR. If a CCR satisfies 
the following conditions, we call it a core of critical code 
regions (in short, CCCR): (1) the CCR is a leaf node in 
the code region tree; (2) for a CCR, its children nodes are 
not CCR. For example, in Fig[Tl both code region 6 and 
code region 7 are CCCR. 

We propose a top-down searching algorithm — Algorithm 
[2] to locate dissimilarity bottlenecks as follows: 

According to Line 17-26 in Algorithm [21 a CCCR has 
higher effect on the clustering results than the other chil- 
dren of its parent CCR, and hence we only consider CCCR 
as dissimilarity bottlenecks, on which users should focus 
for performance optimization. If the number of clusters or 
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Algorithm 2 The searching algorithm for dissimilarity 
bottlenecks{} 

n: the number of code regions; 

r: the number of 1-code region; 

m: the number of process or threads; 

1. CCR_set=nuU; 

2. CCCR_set=nuU; 

3. for each code region j, j = l...n do 

4. T backupij — Tij,i = l...m. 

5. if its depth is greater than one then 

6. Tij = 0,i = l...m. 

7. end if 

8. end for 

9. Obtain the clustering results. 

10. for each code region j, j — l...n do 

11. if its depth is equal with one then 

12. Tjj =0,i= l...m. 

13. Obtain the new clustering results. 

14. if the clustering result changes then 

15. Add code region j into CCR_set. 

16. Recursively analyze children of code region j. 

17. for each child code region k do 

18. Tik = T backupikyi = l...m. 

19. Obtain the new clustering results. 

20. if the clustering result does not change then 

21. Add code region k into CCR_set. 

22. if (CCR fc is a leaf node) or (its any child 
is not a CCR) then 

23. CCR fc is a CCCR. 

24. end if 

25. end if 

26. end for 

27. Tij = T backupij,i — l...m. 

28. end if 

29. end if 

30. end for 

31. if CCR_set is null then 

32. Combine s adjacent 1-code regions into composite 
code regions without overlapping, s > 2. 

33. Repeat the above analysis. 

34. if CCR_set is nuh and s < (r - 1) then 

35. increment s and repeat the above analysis. 

36. end if 

37. end if 



members of a cluster change, we think the clustering result 
changes, or else not. 

We also propose a simple searching algorithm to refine 
the scope of disparity bottlenecks as follows: 

• If a leaf node j is a CCR, then the code region j is a 
CCCR. 

• For a none-leaf CCR j, if its severity degree is larger 
than that of each child node, then we consider the 
code region j as a CCCR. 

4-. 4- Root Cause Analysis 

In this section, we introduce the background material 
of the rough set theory, and present the approaches to 
recovering roots causes of dissimilarity and disparity bot- 
tlenecks, respectively. 

4.4- 1- The rough set approach fl2 l fT^ I 

The rough set approach is a data mining method that 
can be used for classifying vague data. In this paper, we 
use the rough set approach to uncovering the root causes 
of dissimilarity and disparity bottlenecks. 

We start with introducing some basic terms, including 
information system^ decision system, decision table, and 
core. 

An information system is a pair A = {U,A), where U 
is is a non-empty finite set of objects, called the universe, 
and A is a non-empty finite set of attributes such that 
a : U ^ Va io'c every a G A. The set Va is called the value 
set of a. 

A decision system is any information system of the form 
A = (UjAUd), where d t A is the decision attribute. The 
elements of A are called conditional attributes. 

As shown in Tabled a decision table is used to describe 
the decision system. Each entry of a decision table con- 
sists of three parts: object ID, conditional attributions, and 
decision attribution. For example, in Table [H the set of 
object ID is {0, 3}, the set of attributions is {ai, 04}, 
and the set of decisions is {N,P}. 

The core attributions are the attributions that are crit- 
ical to distinguishing with the decision attributions. How 
to find the core attributions is a main research field in the 
rough set approach. One of the solutions is to create a 
discernibility matrix (l^ according to the decision table, 
and then obtain the core attributions using a discernibility 
matrix as follows: 

For a decision system, its decision-relative discernibility 
matrix is a symmetric nxn with entries Cy given in Equa- 
tion [3] Each entry thus consists of the set of attributions 
upon which Xi and Xj differ 14 1. 



(a e A\a{xi) + a{xj) if d{xi) + d{xj)) 
{(f) otherwise ) 

(3) 
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A discernibility function /a for the decision table A is 
a Boolean function of m Boolean variables oi, 02, 
defined in Equation [H For example, for Table [21 the dis- 
cernibility functions are shown in Equation [S] 

/A(ai, ...ajn) = /\{\/ \ l <i<j < n,Cij + 0} (4) 

Table 2: An example of decision table 
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Figure 3: The discernibility matrix for the decision table 
in Tabled 



The core attributions are the same conjunctive terms 
shared by the discernibility functions of each object, which 
are defined in Equation [51 



/A(oi,a2,a3,a4) =(oi) A (02 Voa) 

(oi V 04) A (02 V 03 V 04) 



(5) 



According to Equation [5l the same conjunctive terms 
are {01,02} or {01,03}, which are the core attributions of 
Table [1 

4-4- ^- Root cause analysis 

For performance optimization, users need to know the 
root causes of bottlenecks. In this section, we propose the 
rough set theory based approach to uncovering the root 
causes of dissimilarity and disparity bottlenecks, and give 
suggestions for performance improvements. 

As shown in FiglH we create the decision table for dis- 
similarity bottlenecks as follows: we choose the rank of 
each process as the object ID. We select LI cache miss 
rate, L2 cache miss rate, disk I/O quantity, network I/O 
quantity and instructions retired as five different attribu- 
tions ak,k=i...5- 

We take the attribution oi (LI cache miss rate) as an 
example. For process i, the entry of the decision table 
corresponding to oi is obtained as follows: 

For the performance vector Ti, where i — l...m, we 
assign with the LI cache miss rate of the jth code region 
in process i. 



After having created the performance vector, we use the 
simplified OPTICS clustering algorithm to classify perfor- 
mance data. If it is classified into a cluster with the ID of 
X according to the approach introduced in Section 14.2.11 
for process i, we assign the entry corresponding to oi with 

X. 

For process i, the decision value is the ID of the cluster 
into which process i is classified according to the metrics 
of the CPU clock time. 



Accessorial metrics: cache 
jniss rate, I/O quantity . . . 



f Process 
V rank 



simplified 

OPTICS 
clustering 



Classification 
number / 



ID 



Attribution Decision 



Decision Table 



Figure 4: The approach to uncovering the root causes of 
dissimilarity bottlenecks. 

For disparity bottlenecks, we create the decision table 
as follows: 

We use the code region ID to identify each table entry. 
We also select LI cache miss rate, L2 cache miss rate, 
disk I/O quantity, network I/O quantity and executing 
instruction number as five different attributions. 

We take attribution oi (LI cache miss rate) as an ex- 
ample. For code region j, the element of the decision table 
corresponding to oi is obtained as follows: 

For each code region, we obtain the average LI cache 
miss rate in all processes or threads. We use the K-means 
clustering algorithm to classify the average LI cache miss 
rates of each code region into five categories: very high (4) 
, high (3), medium (2) , low (1), and very low (0). For code 
region j, if its severity category is higher than medium, we 
assign the entry corresponding to the attribution oi with 
1, otherwise 0. 

For code region j, if it is a disparity bottleneck accord- 
ing to the approach proposed in Section 14.2.21 then the 
decision value is 1, otherwise 0. 

After having created the decision table, we obtain the 
core attributions according to the approaches proposed in 
Section [4.4.1l Since the core attributions are the ones that 
have dominated effects on the decision, we consider them 
as the root causes of disparity bottlenecks. 

5. Auto Analyzer implementation 

In order to evaluate the effectiveness of our proposed 
methods, we have designed and implemented a prototype, 
AutoAnalyzer. Presently, AutoAnalyzer supports debug- 
ging of performance problems of SPMD style MPI applica- 
tions, written in C, C++, FORTRAN 77, and FORTRAN 
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Figure 5: The approach to uncovering the root causes of 
disparity bottlenecks. 



90. We are also extending our work to MapReduce [49| 
and other data-parallel programming models [i^. Fig. [S] 
shows Auto Analyzer architecture. 
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Figure 6: The AutoAnalyzer Architecture. 

The major components of AutoAnalyzer include auto- 
matic instrumentation, data collector, data management, 
and data analysis. 

Automatic instrumentation. On a basis of OMPi 
[l6| — a source-to-source compiler, we have implemented 
the source code level instrumentation. Without human in- 
volvement, our tool uses source-to-source transformation 
to automatically insert instrumentation code. After hav- 
ing parsed the program, the system builds the abstract 
syntax tree (AST). AST shows program's structure infor- 
mation, e.g., the begin and end of functions, procedures 
or loops. With the structure information, our tool can 
automatically insert instrumentation codes, and divide a 
program into code regions. 

Our tool supports several instrumentation modes: outer 
loop, inner loop, mathematical library, parallel interface 
library hke MPI, system caU, C/FORTRAN hbrary, and 
user-defined functions or procedures. Without any restric- 
tions on instrumentation, a program can be divided into 
hundreds or thousands of code regions. For example, af- 
ter instrumentation, a parallel program of 2,000 lines is 



divided into more than 300 code regions. This situation 
has negative influence on the performance analysis because 
AutoAnalyzer needs to collect and analyze a large amount 
of performance data. To decrease the size of performance 
data, we propose two solutions: first, we adopt two rounds 
of analysis. For the first round, we divide a parallel pro- 
gram into coarse-grained code regions, e.g., per function, 
for roughly locating bottlenecks; for the second round, we 
divide the code regions that are possible bottlenecks into 
fine-grained code regions, e.g., loops. Second, users can 
selectively choose one or more modes to instrument the 
code, or interact with the GUI of the tool to eliminate, 
merge, and split code regions. 

Data collector. We collect performance data from four 
hierarchies: application, parallel interface, operating sys- 
tem, and hardware. 

In the application hierarchy, we collect the wall clock 
time and the CPU clock time of each code region. In the 
parallel interface hierarchy, we have implemented an MPI 
library wrapper to record MPI routines' behavior of both 
point-to-point and collective communication. The wrap- 
per is implemented by wrapping the MPI standard pro- 
filing interface — PMPI. In the wrapper, we instrumented 
codes to collect performance data of MPI library, e.g., the 
executing time and the quantity of data transferred in MPI 
library. 

In the operating system hierarchy, we use systemtap 
fhttp : / / sourceware . org/ systemstap/) to monitor disk 
I/O, recording the execution time and quantity of data 
read and written in I/O operations. Systemtap is based 
on Kprobe, which is implemented in the Linux kernels. 
Kprobe can instrument the system calls of the Linux ker- 
nel to obtain the executing time and functions' parameters 
as well as I/O quantity. 

In the hardware hierarchy, we use PAPI 
( http : //icl ■ cs ■ utk . edu/papi 7|) to count hardware 
events, including LI cache miss, LI cache access, L2 cache 
miss, L2 cache access, and instructions retired. 

Data management. We collect all performance data 
on different nodes and send them to one node for analysis. 
All data are stored in XML files. 

Data analysis. We analyze performance data of code 
regions so as to search bottlenecks and uncover their root 
causes. 

Before using AutoAnalyzer, users need to perform the 
following setup work. Before installing PAPI, they must 
make sure that the kernel has been patched and recom- 
piled with the PerfCtr or Perfmon patch. Then they can 
compile the PAPI source code to install it. SystemTap 
is also dependent upon the installation of several pack- 
ages: kernel-debuginfo, kernel-debuginfo-common RPMs, 
and the kernel-devel RPM. Before installing Systemtap, 
users need to install these packages. However, with the 
support of state-of-the-practice operating system deploy- 
ment tool, like Systemlmager, which is open source, we 
can automate the deployment of AutoAnalyzer. 
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6. Evaluation 

In this section, we use two production parallel applica- 
tions, written in Fortran 77, and one open-source parallel 
application, written in C+-I-, to evaluate the correctness 
and effectiveness of Auto Analyzer. 

The first program is ST, which calculates the seis- 
mic tomography using a refutations method. ST is 
on the production use in the largest oil company in 
China. FiglT] shows the model obtained with ST. The sec- 
ond one is a parallel NPARIWAY module of SAS. SAS 
is a system widely used in data and statistical analy- 
sis. The third one is MPIBZIP2 — a parallel implemen- 
tation of the bzip2 block-sorting file compressor that uses 
MPI and achieves significant speedup on cluster machines 
( ihtt p : / / compression. ca/mpibzip2/). 

In Section [^Tl Section l^?^ and Section [^751 for three ap- 
plications we choose the CPU clock time as the main per- 
formance measurement for searching dissimilarity bottle- 
necks, and our proposed CRNM as the main performance 
measurement for disparity bottlenecks, respectively. In 
Section 16.41 we investigate the effects of different metrics 
on locating bottlenecks. 

6.1. ST 

In this section, we use a production parallel application 
of 4307 line codes — ST, to evaluate the effectiveness of 
our system. To identify a problem, a user of our tools 
does little to start. The tool automatically instruments 
the code. After analysis, the tool informs the user about 
bottlenecks and their root causes. For ST, it took about 2 
days for a master student in our lab to locate bottlenecks 
and rewrite about 200 lines to optimized the code. 




Figure 7: The model obtained with ST. 

Out testbed is a small-scale cluster system, connected 
with 1000 Mbps networks. Each node has two processors, 
each of which is AMD Optcron with 64KB LI data cache. 
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Figure 8: The code region tree of ST. Code region 11, 12 
are in subroutine ramodS, which is nested in code region 
14. All code regions contain loops. 



6AKB LI instruction cache, and 1MB L2 cache. The OS 
version is linux — 2.6.19. 

In the rest of this section, we give the detail of locat- 
ing bottleneck and optimizing performance. Section [6.1.11 
reports a case study of ST with coarse-grain code regions 
for locating bottlenecks and optimizing application. Sec- 
tion 16.1.21 reports a case study of ST with fine-grain code 
regions. 



6.1.1. Locating bottlenecks and optimizing the applications 

To reduce the number of code regions. Auto Analyzer 
support an instrumentation mode that allows a user to 
select whether to instrument functions or procedures, or 
outer loops. In this subsection, we instrument ST into 14 
coarse-grain code regions, and Fig. [S]shows the code region 
tree. For ST, a configuration parameter — the shot number 
decides the amount of data input. For this experiment, 
the shot number is 627. 

According to the similarity analysis approach proposed 
in Section 14.31 AutoAnalyzer outputs the analysis result 
for each process behavior of ST, which is shown in Fig[Ul 
We can find that all processes are classified into five clus- 
ters. For a SPMD program, the analysis results indi- 
cate that dissimilarity bottlenecks exist. According to the 
searching result, we can conclude that code region 11 and 
code region I4 are CCRs. Since code region 11 is the child 
node of code region 14, we consider code region 11 as a 
CCCR, which is the location of the problem. 

We create the decision table to analyze the root causes 
of code region 11. 

Table [3] shows the decision table. In the decision ta- 
ble, the attributions afc,fe=i,2,3,4,5 represents LI cache miss 
rate, L2 cache miss rate, disk I/O quantity, network I/O 
quantity, and instructions retired, respectively. Fig llOl 
shows the discernibility matrix. 
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Figure 9: The analysis results of similarity measurement. 
Table 3: Decision table for the dissimilarity bottlenecks 
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Figure 10: The discernibility matrix for Table [31 



According to the approach proposed in Section l4.4.1l we 
find that 05 is the core attribution, which indicates that 
the variance of instructions retired in different processes is 
the root cause of code region 11. 

Fig l 11! verifies our analysis, from which we can discover 
obvious differences of instructions retired of code region 11 
among different processes. 

Using the K-means clustering approach, AutoAnalyzer 
outputs the analysis result for each code region of ST, 
which is shown in Fig ll2l The severity degree of code 
region 14, code region 11, code region 8 is larger than 
medium, respectively. According to the analysis result, 
we confirm that code regions I4, code region 11 and code 
region 8 are CCR. Since code region 11 is nested within 
code region 14 and the severity degree of code region 11 is 
the same as code region 14, so code region 11 is a CCCR. 
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Figure 11: The variance of instructions retired of code re- 
gion 11 in different processes. 



Since no code region is nested in code region 8, so code 
region 8 is also a CCCR. We focus on code region 8 and 
code region 11 for performance optimization. 

very high: code regions: 14,11 
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medium: code regions: 5,6 
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very low: code regions: 1,9,3,7,10,12,13,4 

Figure 12: The analysis results of the k-means clustering 
approach. 
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Figure 13: The average CRNM of each code region. 

We analyze the root causes of disparity bottlenecks with 
the rough set approach. The decision table is shown in 
Table [H In the decision table, attribution 0^^^=1.2,3.4,5 
represents LI Cache miss rate, L2 cache miss rate, disk I/O 
quantity, network I/O quantity, and instructions retired, 
respectively. 

According to the approach proposed in Section l4.4.1l we 
find that {02,03} is the core attributions, which indicates 
high L2 cache miss rate and high disk I/O quantity are the 
root causes of disparity bottlenecks. Then we search the 
decision table and find that the root cause of code region 8 
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Table 4: Decision table used for searching disparity bot- 
tlenecks. 
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is high disk I/O quantity and the root cause of code region 
11 is high L2 cache miss. From the performance data, we 
can observe that the disk I/O quantity of code region 8 is 
as high as 106G and the L2 cache miss rate of code region 
11 is high as 17.8%. 

In order to eliminate the dissimilarity bottleneck — code 
region 11, we replace the static load dispatching in the 
master process, adopted in the original program, with a 
dynamic load dispatching mode. After the optimization, 
we use AutoAnalyzer to analyze the optimized code again. 
The analysis results show that all processes, excluding the 
code regions in the master process responsible for the man- 
agement routines, are classified into one cluster, indicating 
that all processes have the similar performance with bal- 
anced workloads. 

We take the following approaches to optimizing the 
disparity bottlenecks — code region 8 and code region 11. 
First, we improve code region 8 by buffering as many as 
data into the memory. Second, we improve the data local- 
ity of code region 11 hy breaking the loops into small ones 
and rearranging the data storage. 

We use AutoAnalyzer to analyze the optimized code 
again. The new analysis results show code region 8 is not 
a disparity bottleneck again, while code region 11 is still a 
disparity bottleneck, but the average CRNM value of code 
region 11 decreases from 0.41 to 0.26. The root cause of 
code region 11 is no longer the high L2 caches miss rate, 
but the large quantity of instructions retired. 

FigHHshows the performance of ST before and after the 
optimization. With the disparity bottlenecks eliminated, 
the performance of ST rises by 90% in comparison with 
the original program. With the dissimilarity bottlenecks 
eliminated, the performance of ST rises by 40% in com- 
parison with the original program. With both disparity 
and dissimilarity bottlenecks eliminated, the performance 
of ST rise by 170% in comparison with the original pro- 
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Figure 14: ST performance before and after the optimiza- 
tion. 



6.1.2. A case study of ST with fine- grain code regions 

In this subsection, on a basis of the code region tree 
shown in Fig[51 we divide the program into fine-grained 
code regions, which is shown in Fig. 1151 For saving time, 
we choose the shot number as 300, and the run time of 
application is about 9815.52454 seconds. Please note that 
with the exception of newly added code regions, the same 
code regions in Fig[S]and Fig. [T^keep the same ID. 

We use the simplified OPTICS clustering algorithm to 
find dissimilarity bottlenecks. From the analysis result, 
we can find that code region 14, code region 11, and code 
region 21 are OCRs. Since code region 21 is nested within 
code region 11 and the latter is also nested within code 
region 14, we confirm that code region 21 is a CCCR, 
which is the location of the problem. 

From FiglHland Fig. [121 we can observe the newly iden- 
tified dissimilarity bottleneck — code region 21 is nested 
within code region 11, which is identified as a dissimilar- 
ity bottlenecks in Section 16.1.11 when a coarse-grain code 
region tree is adopted. 
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Figure 15: The refined code region tree. 

We also use the k-means clustering approach to locate- 
ing disparity bottlenecks. From the analysis results, we 
conclude code region 19 and code region 21 are disparity 
bottlenecks. 
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Figure 16: The variance of instructions retired of code re- 
gion 21 in different processes. 



Figure 17: The average CRNM of each code region in eight 
processes. 



From Fig|S]and Fig. [T^ we can observe the newly identi- 
fied disparity bottlenecks — code region 19 and code region 
21 are nested within code region 8 and code region 14, 
respectively, which are identified as disparity bottlenecks 
in Section 16.1.11 when a coarse-grain code region tree is 
adopted. These results show our two-round analysis, in- 
troduced in Section [51 indeed can refine the scope of both 
dissimilarity bottlenecks and disparity bottlenecks. FigfTBl 
shows the variance of instructions retired of code region 
21 in different processes. 

6.2. NPARIWAY 

NPARIWAY is a module of the SAS (Statistical Anal- 
ysis System) responsible for reading, writing, managing, 
analyzing, and displaying data. SAS is Widely used in 
data and statistical analysis. The parallel NPARIWAY 
module uses MPI to calculate the exact p- value to achieve 
high performance. Auto Analyzer divides the whole pro- 
gram into 12 code regions to separate functions, subrou- 
tines, and outer loops. 

Out testbed is a small-scale cluster system. Each node 
has two processors, each of which is a 2 GHz Intel Xeon 
Processor E5335 with quad cores, 128KB LI data cache, 
128KB LI instruction cache, and 8 MB L2 cache. The 
operating system is Linux 2.6.19. 

6.2.1. Bottleneck detection 

The analysis results of AutoAnalyzer shows all processes 
are classified into one cluster, which indicates that no dis- 
similarity bottleneck exists. AutoAnalyzer also analyzes 
the application performance from the perspective of each 
code region. The analysis results show that the severity 
degrees of code region 3 and code region 12 are larger than 
medium, and we consider them as CCR. Because there are 
no nested code regions in code region 3 and code region 12, 
both of two code regions are CCCRs, which we consider 
disparity bottlenecks. 

We also use the rough set approach to uncover the root 
causes of disparity bottlenecks. In the decision table, the 
attributes afc.fc=i.2,3,4,5 represents LI cache miss rate, L2 



cache miss rate, disk I/O quantity, network I/O quantity, 
and instructions retired, respectively. 

Through analyzing the discernibility matrix, we con- 
clude that {04,05} are the core attributions, which indi- 
cates that both high network I/O quantity and high in- 
structions retired are root causes of the disparity bottle- 
necks. Then we search the decision table and find that 
code region 3 has high quantity of instructions retired. 
Meanwhile code region 12 has both high quantity of in- 
structions retired and high network I/O quantity. From 
the performance data, we can see that instructions retired 
of code region 3 and code region 12 take up 26% and 60% of 
the total instructions retired of the program, respectively. 
At the same time, the network I/O quantity of code region 
12 takes up 70% of the total network I/O quantity of the 
program. 



6.2.2. The performance optimization 

According to the root causes uncovered by AutoAna- 
lyzer, we optimize the code to eliminate the disparity bot- 
tlenecks. The performance of NPARIWAY rises by 20% 
after the optimization. 

We optimize code region 3 and code region 12 by elimi- 
nating redundant common expressions. For example, there 
is one common multiply expression occurring three times 
in code region 3. We use one variable to store the results of 
the multiply expression at its first appearance, and later 
directly use the variable to avoid subsequent redundant 
computation. In this way, we can decrease massive in- 
structions by eliminating redundant common expressions 
in deep loops. 

Then we analyze the code again. For the optimized code 
region 3, the analysis results show that the quantity of in- 
structions retired and the wall clock time are reduced by 
36.32% and 20.33%, respectively. For the optimized code 
region 12, the analysis results show that the instructions 
retired and the wall clock time are reduced by 16.93% and 
8.46%, respectively. For code region 12, we fail to elimi- 
nate high network I/O quantity. 
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6.3. Analysis of an open source application — MPIBZIP2 

MPIBZIP2 is a parallel implementation of the bzip2 
block-sorting file compressor that uses MPI and achieves 
significant speedup on cluster machines. The output is 
fully compatible with the regular bzip2 data so any files 
created with MPIBZIP2 can be uncompressed by bzip2 
and vice-versa. This software is open source and dis- 
tributed under a BSD-style license. AutoAnalyzer divides 
the whole program into 16 code regions to separate func- 
tions, subroutines, and outer loops. Fig [T^ shows the code 
region tree. Out testbed is just the same as that in Section 
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Figure 18: The code region tree of TMPIBzip2. 

Excluding the code regions that are responsible for man- 
agement routines in the master process, we use the sim- 
plified OPTICS clustering algorithm to find dissimilarity 
bottlenecks. From the analysis result, we find all processes 
are classified into one cluster, and we confirm that there 
are no dissimilarity bottlenecks in MPIBZIP2. We also use 
the K-means clustering approach to analyzing the dispar- 
ity bottlenecks. The analysis results show that the severity 
degrees of code region 6, and code region 7 are larger than 
medium, and we consider them as CCR. Since there are 
no nested code regions in code region 6 and code region 7, 
both of two code regions are CCCRs, which we consider 
disparity bottlenecks. Fig [TOl shows the average CRNM of 
each code region of MPIBZIP2. 
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Figure 19: The average CRNM of each code region of 
MPIBZIP2. 



We uncover the root causes of disparity bottlenecks with 
the rough set approach. In the decision table, the at- 
tributes afe_fc=i_2,3,4,5 represents LI cache miss rate, L2 
cache miss rate, disk I/O quantity, network I/O quantity 
and instructions retired, respectively. Through analyzing 
the discernibility matrix, we conclude that {a4, a5} are the 
core attributions, which indicates that network I/O quan- 
tity and instructions retired are root causes of the disparity 
bottlenecks. Then we search the decision table and find 
that the root cause of code region 6 is high quantity of 
instructions retired and the root cause of code region 7 is 
high network I/O quantity. From the performance data, 
we also observe that instructions retired of code region 6 
take up 96% of the total instructions retired of the pro- 
gram. At the same time, the network I/O quantity of code 
region 7 take up 50% of the total network I/O quantity of 
the program. 

Through reading the source code, we 
found out that code region 6 calls the 
BZ2 bzBuf fToBuf fC'ompress{) function to com- 
press the data. BZ2_bzBuf fToBuf fCompressQ is a 
third-party function and packaged in the static library 
libbz2.a of bzip2. Code region 7 call MPI_Send{) to 
send the compressed data to the master process. Those 
two bottlenecks are difficult to optimize. For the first 
bottleneck, we need to improve the mature compression 
algorithm; for the second bottleneck, we need to decrease 
the data transferred to the master process, however the 
data has been compressed. We fail to optimize the code. 

6.^. Effect of different metrics on bottleneck detections 

For three applications, we investigate the effect of differ- 
ent metrics on locating bottlenecks. For ST, NPARIWAY, 
and MPIBZIP2, the number of code regions is 14, 12, and 
16, respectively. For ST, we perform the experiments on 
the same testbed as that in Section [5TT1 but the shot num- 
ber is changed from 627 to 300 for saving time. For two 
other applications, the testbed is the same as that in Sec- 
tion O 

We choose the CRNM value, the CPI, and the wall clock 
time of each code region as the main performance measure- 
ment to locate disparity bottlenecks, respectively. 

Our experiment shows CRNM is more valuable than 
CPI or the wall clock time on locating disparity bottle- 
necks. For example, for ST, using CRNM, AutoAnalyzer 
identifies code region 8, code region 11, and code region 
14 as CCR, and we significantly improve the application 
performance through optimizing them, as shown in Sec- 
tion 16.1.11 using the average wall clock time of each code 
region, AutoAnalyzer identifies code region 2,5, 6, 10 as 
disparity bottlenecks in addition to code region 8, 11 and 
14. From Fig. [201 we can observe code region 2, 5, 6, 10 
take up trivial proportion of the running time of the appli- 
cation. Using CPI, AutoAnalyzer identifies code region 2, 
8 as disparity bottlenecks, while code region 11 and code 
region 14, which take up most of the running time of the 
application, are ignored. 
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Figure 20: The average wall clock time and CPU clock 
time of each code region of ST. 



FigHOl Figim and Fig[52] show the average wall clock 
time and CPU clock time, the average CRNM, and CPI 
of each code region of ST, respectively. 
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hardware events: cache or TLB miss, cache line invention, 
pipeline stall caused by data dependency or branches mis- 
prediction and so on. So our normalized CPI represents 
a measurement of the importance of a code region to the 
overall performance of the application. 

We choose the wall clock time and the CPU clock time as 
the main measurement to locate dissimilarity bottlenecks, 
respectively. For three applications, we utilize two met- 
rics to locate dissimilarity bottlenecks, respectively. As 
an example, Figl^D] compares the average wall clock time 
and the average CPU clock time of each code region of 
ST, and Figl53] shows the wall clock time and the CPU 
clock time of code region 11 of ST in different processes, 
which is identified as a dissimilarity bottleneck in Section 
16.1.11 Though two measurements have some differences, 
our results show they have the same effects on locating 
dissimilarity bottlenecks. 

■Wall Clock Time acPU Clock Time 




3 4 5 6 

process rank 



Figure 23: The wall clock time and CPU clock time of 
code region 11 of ST in different processes. 



Figure 21: The average CRNM of each code region of ST. 
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Figure 22: The average CPI of each code region of ST. 

CRNM is more valuable than CPI or wall clock time on 
locating disparity bottlenecks because of the following two 
reasons: first, by using the ratio of the wall clock time of 
a code region to the wall clock time of the whole program, 
our metrics can judge the performance contribution of a 
code region to the overall performance of a program. Sec- 
ond, CPI measures the efficiency of instruction execution. 
Derived from the total instructions retired and the total 
executing cycles, CPI is a basic metric that reflects all 



7. Conclusions 

This paper presented a series of innovative methods in 
automatic performance debugging of SPMD-style paral- 
lel programs. For SPMD-style parallel applications, we 
utilized two effective clustering algorithms to investigate 
the existence of two types of bottlenecks: dissimilarity 
bottlenecks that cause process behavior dissimilarity and 
disparity bottlenecks that cause code region behavior dis- 
parity; if there are bottlenecks, we presented two search- 
ing algorithms to locate performance bottlenecks. On a 
basis of the rough set theory, we proposed an innovative 
approach to automatically uncovering root causes of bot- 
tlenecks. We designed and implemented AutoAnalyzer. 
On the cluster systems with two different configurations, 
we used two production applications and one open source 
code — MPIBZIP2 to verify the effectiveness and correct- 
ness of our methods. Meanwhile, we also investigate the 
effects of different metrics on locating bottlenecks, and our 
experiment results showed for three applications, our pro- 
posed metrics — CRNM outperforms CPI and wall clock 
time in terms of locating disparity bottlenecks; the wall 
clock time and the CPU clock time have the same effects 
on locating dissimilarity bottlenecks. 
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In the near future, we will extend our method to more 
generalized parallel applications beyond the SPMD style. 
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